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ABSTRACT 

This paper presents a novel static analysis technique to detect XML 
query-update independence, in the presence of a schema. Rather 
than types, our system infers chains of types. Each chain rep- 
resents a path that can be traversed on a valid document during 
query/update evaluation. The resulting independence analysis is 
precise, although it raises a challenging issue: recursive schemas 
may lead to inference of infinitely many chains. 

A sound and complete approximation technique ensuring a finite 
analysis in any case is presented, together with an efficient imple- 
mentation performing the chain-based analysis in polynomial space 
and time. 

I. INTRODUCTION 

A query and an update are independent when the query result is 
not affected by update execution, on any possible input database. 
Detecting query-update independence is of crucial importance in 
many contexts: i) to minimize view re-materialization; ii) to ensure 
isolation, when queries and updates are executed concurrently; iii) 
as outlined in [6], to enforce access control policies, when the query 
is used to define the part of the database that must not be changed 
by a user update. 

In all these contexts, benefits are amplified when query-update 
independence can be checked statically. In order to be useful, ev- 
ery static analysis technique must be sound: if query-update inde- 
pendence is statically detected, then independence does hold. The 
inverse implication (completeness) cannot be ensured in the general 
case, since static independence detection is undecidable (see [6]). 
This means that if a static analyzer is used, for instance, in a view 
maintenance system, sometimes views are re-materialized after up- 
dates even if not needed, because the analysis has not been smart 
enough to statically detect a view-update independence. Useless 
view re-materialization frequently occurs if a static analyzer with 
low precision is adopted. This can lead to great waste of time, since 
view materialization cost can be proportional to the database size. 

High precision of static independence analysis can be ensured 
by taking into account schema information. In many contexts, 
schemas are defined by users, mainly by means of the DTD or XML 
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Schema languages, while in other contexts quite precise schemas, 
in the form of a DTD, can be automatically inferred, by using accu- 
rate and efficient existing techniques like the one proposed by Bex 
etal. in [8]. 

State of the Art 

Schema-based detection of XML query-update independence has 
been recently investigated. The state of the art technique has been 
presented by Benedikt and Cheney in [6]. This technique infers 
from the schema the set of node types traversed by the query, and 
the set of node types impacted by the update. The query and the 
update are then deemed as independent if the two sets do not over- 
lap. This technique is effective since the static analysis i) is able 
to manage a wide class of XQuery queries and updates, ii) can be 
performed in a negligible time, and iii) as a consequence, even on 
small documents, can avoid expensive query re-computation when 
independence wrt an update is detected. However, the technique 
has some weaknesses. As illustrated in [6], in some cases, inde- 
pendence is not detected, due to some over-approximation made 
by the type inference rules. 

For example, this technique cannot detect independence between 
the query qi=//a//c and the update undelete / /b//c, when 
the schema enforces that c descendants of b nodes are never descen- 
dants of a nodes. This is because the type inference technique of [6] 
infers the type c both for the query path and the update path, with- 
out considering contextual information about the inferred types. 
Since the query and update types overlap, independence is wrongly 
excluded. Indeed, the technique is not precise enough when ances- 
tor or descendant axes are used in queries and updates. 

The way XPath axes are typed is not the only source of low pre- 
cision of this technique. Consider documents typed by the well 
known bibliographic DTD used in [1], the query q2=/ /title and 
the update U2=f or x in //book return insert <author/> 
into x. The technique of [6] infers bib, book and title as types 
traced by q2, and book as type impacted by U2. According to this 
technique, the two expressions share the type book, hence indepen- 
dence is not detected, while it holds. 

In none of the above examples, independence can be detected 
by techniques ignoring schema information like the path-based ap- 
proach proposed by Ghelli et al. 1 [15] and the recent destabilizers- 
based approach proposed by Benedikt and Cheney [5]. Follow- 
ing these approaches, for the example qi -ui , the paths //a//c and 
//6//c are deemed as overlapping since, for instance, documents 
matching the path /a/b/c match both paths and similarly, for ex- 
ample q2-u 2 and the paths / /title and / /book. 

'This technique deals with update-commutativity detection for a 
language with side effects and can be directly extended to query- 
update independence detection without side effects. 
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Contributions 

This paper proposes a novel schema-based approach for detecting 
XML query-update independence. Differently from [6, 10, 11], 
our system infers sequences of labels (hereafter called chains). In- 
tuitively, for each node that can be selected by a query/update path 
in a schema instance, the system infers a chain recording i) all la- 
bels that are encountered from the root to the node, ii) in the order 
of traversal. This information is at the basis of a precise static inde- 
pendence analysis. For instance, for qi=//o//c and ui=//6//c 
over the schema { doc<— (a|fe)*, a^— c, Zm— c }, the chains doc.a.c 
and doc.b.c are inferred for the query and the update, respectively. 
Disjointness of these two chains allows us to statically derive the 
independence for qi-ui . For the DTD of the XQuery Use Cases [1] 
before discussed, the chains bib. book. title and bib. book. author , 
respectively inferred for q2 and 112, diverge after the book sym- 
bol; this allows us to conclude independence for q2-U2. These two 
examples highlight that chain inference provides a more precise in- 
dependence analysis than that of [6, 15, 5]. 

The main contribution of this work is a precise algorithm to de- 
tect independence for a query-update pair q-u knowing that docu- 
ments are valid wrt a DTD d. It strongly relies on the following 
developments. 

• Chain-based independence for q-u, a static notion, is the foun- 
dation of our algorithm: starting from the set Cd of all possi- 
ble chains associated with the DTD d, our inference system 
extracts subsets of chains C q and C u which soundly approx- 
imate the navigation through valid documents made by the 
evaluation of the query q and the update u, respectively. Note 
that our inference system (Section 3) is cautiously specified 
for dealing with all XPath axes. Chain-based independence 
is the result of the absence of overlapping pair of chains in C q 
and C u . Chain-based independence is proved to be sound wrt 
the semantics notion of query-update independence (Section 
4). 

• A major step of our work concerns recursive schemas, for 
which chain-based independence analysis may cripplingly 
involve to deal with an infinite number of chains. Our tech- 
nique enabling the restriction of the analysis to finite subsets 
of C q and C u is a key contribution, and the core of our algo- 
rithm is the resulting finite analysis (Section 5). It is proved 
to be equivalent to the infinite analysis 2 . 

• The algorithm has been carefully implemented, and exten- 
sive tests have been performed to validate our claim of preci- 
sion and efficiency (Section 6). Indeed, using a DAG-based 
representation of inferred chains allows the finite analysis to 
run in polynomial space and time. Concerning precision, our 
results show that our technique outperforms [6] to a large ex- 
tent. Test results also show that high savings of time can be 
ensured by avoiding re-evaluation of queries deemed as in- 
dependent of an update, even on relatively small documents. 

A nice property of our technique (Section 7) is that it can be 
easily extended in order to cope with Extended DTDs [14], and 
thus XML Schema. Discussions about related and future work are 
provided in Sections 8 and 9. 



2 This is reminiscent of the well known Finite Model Property tech- 
nique used in the context of finite model theory [17]. 



<doc> 

<a><c/></a> 

<a><c/x/a> 

<bxc/x/b> 

<axc/x/a> 
</doc> 

a document t valid wrt d 



Si=doc d(doc)=(a | 6)* 
d(a)=c d(6)=c 

a DTD d ~| 

It <— doc[l±. I2, 13, h] 
h<-a[l[] l 2 <-a[l' 2 ] 
h <— b[l' 3 ] U <— a[l!i] 
l[ <- c[],i = 1..4 

the store (a, l t ) 



Figure 1: Document and store 

2. PRELIMINARIES 

Data model. We represent an instance of the XML data model as 
a store o, which is an environment associating each node location 
(or identifier) I with either an element node a[L] or a text node s. 
In a[L], a is the element tag, while L=(h, . . . , l n ) is the ordered 
list of children locations in o. A tree is a pair t=(cr, It), where It is 
its root location. dom(a) denotes the set of locations of a, while 
a@l denotes the subtree of a rooted at I whose domain is limited 
to locations connected to I. See Figure 1 for a small document 
together with its store. 

DTDs. ADTD is a 3-tuple (T, Sd, d) where: T is a finite alphabet 
for element tags, denoted by a, b, c; s^GT is the start symbol; d is 
a function from I! to regular expressions over Hu{S}, where S 
denotes the string type. For simplicity, next we often use only the 
d component to specify a DTD. 

A tree t=(a, It) is valid wrt d, denoted ted, iff there exists a 
mapping v : dom(t) i-¥ HU{S} such that: ii(l t )=Si\ v(l)=S 
implies that a(l) is a text node; v(l)=a implies that a(l)=a[L] 
and the word v[V) is generated by the regular expression d(a). 

Definition 2. 1 (Reachability and Chains). Let Abe a 
DTD, a =>d P holds iff Q,/?£ Hu{S} and j3 occurs in the reg- 
ular expression d(a). A chain c over d is a sequence of labels 
ct\.OL2 ■ ■ .ctn such that on =>d for i=l . . . n—1. 

The set of chains associated with the DTD d is denoted C d . 

For the DTD of Figure 1, the set Cd includes the chains doc.a, 
a.c, doc.a.c, doc.b, b.c, and doc.b.c, because we have doc => d a, 
a =>d c, doc =>d b and b =>d c. 

Given two chains ci and C2, the concatenation of ci and C2 is 
denoted Ci .C2 ; we write Ci ^ C2 to indicate that Ci is a prefix of 
C2, that is C2=ci.c' for some chain c'. 

Observe that chains in C d are of finite length and may start with 
any DTD symbol. The set C d is infinite only if d is a vertical- 
recursive schema. 

Definition 2.2 (Node Type and Chain). Given a and 
l£dom(a), we define typ(7)=a if o~(l)=a[L], otherwise typ(Z)=S. 
The chain associated to the node I is defined by cf=typ(Z) if I has 



no parent, otherwise cf=c 



-parenl(l) 



.typ(J). 



Consider the DTD and store of Figure 1, we have: 
typ(Zi)= typ(/2) = typ(Lt) = doc.a and 
typ(Z'i) = typ(/ 2 ) = typ(ii) = doc.a.c. 



Proposition 2.3. Given a tree t=(a, 
l€dom(cr), we have cfGd. 



) €d, for each location 
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Queries, Updates and Independence. 

We assume that the reader is familiar with the XQuery and XQuery 
Update Facility languages. In this paper we consider the two large 
fragments considered in related approaches [6, 5], respectively de- 
fined by the following grammars. 

q :: = () | q,q | <a>q</a> | s | x/step 

| for x in q return q | let x := q return q 
if q then q else q 

step :: = axis :: <f> <j> :: =a ] text() | node() 

axis :: = self | child | descendant 

descendant— or— self | parent 
ancestor | ancestor— or— self 
preceding— sibling | following— sibling 

The empty-sequence and sequence queries are denoted by () and 
q, q respectively. The query s denotes a constant string value. The 
symbol 4> is used for XPath node tests; a stands for a tag symbol. 
XPath expressions x/stepjY . . . /step n , although used in exam- 
ples, are not directly supported by the grammar; they can be en- 
coded in the standard way, by means of iteration and the allowed 
single step expression; axes that are not included can be easily en- 
coded too. 3 The rest of the grammar is self explicative. 

In examples, /(f> and / /<f> are respectively used as shortcuts for 
/child::0 and /descendant — or— self ::node()/child::</>. 

Also, to simplify the formal treatment, we assume that element 
construction <a>q</a> is not used in the left-hand side expres- 
sion of a for/let-expression. This restriction is met by a very large 
class of queries used in practice, while queries like let x := 
<a>q </a> return <b>x</b> can be rewritten by simple vari- 
able substitution. 

The subset of XQuery Update Facility we consider is defined as 
follows. All update operations (namely: insert, delete, rename and 
replace) are included. 

u ::=()| u.u | for x in q return u 
| let x := q return u 

if q then ui else U2 

delete qo | rename qo as a 
| insert q pos qo j replace qo with q 

pos :: — before | after | into (as first | as last)? 

Like for queries, updates can be composed sequentially or by 
means of let/for statements, where only the return part can contain 
update operations. In other update expressions, q is the target ex- 
pression producing the {target) node in the input document, that is 
where the update has to be done. In insert and replace updates, q 
is the source expression producing elements for the insertion or re- 
placement. Deletion delete qo and renaming rename qo as a 
are self-explicative. According to the W3C semantics [19] the tar- 
get expression q is required to output a single node otherwise a 
run time error occurs. 

Query and update semantics are specified in [12, 19], while a 
succinct and elegant formalization can be found [4], from which 
we borrow some notions that are needed for our own presentation. 
Query semantics is denoted by 

°",7 1= q => o- q ,L q 

meaning that the execution of the query q over a outputs a sequence 
of locations L q , roots of the answer trees for q, and a new store cr q , 

3 e.g., /following::a becomes /ancestor— or— self ::node()/ 
following— sibling::node()/descendant— or— self ::a. 



including a plus new elements built by q; the environment 7 binds 
each free variable of q to a sequence of locations in a. 

According to the W3C specification, update evaluation is split 
into three phases: i) creation of an update pending list (UPL) of 
simple update commands, ii) execution of sanity check on this list, 
and iii) application of the UPL on the input store so as to obtain the 
updated data. Update commands 1 in a UPL w are of the form: 

l :: — ins(L,pos,l) | del(i) | repl(Z, L) \ ren(£, a) 

where / is the target location and L the sequence of roots of source 
elements to be inserted. The creation of the UPL (phase i) is de- 
noted by: 

cr,7 j= u => a w ,w 

As usual, 7 binds u free variables to locations in a and the store 
a w contains newly created locations potentially used in the UPL 
w. Applying the UPL w to the input store a (phase iii) produces 
the updated store. This is denoted by: 

a w h w ~> <7 U 

The composition of phases i) and iii) is denoted by: 

a, 7 |= u : ct u 

Above, dom(o)Cdom(a w )Cdom(a u ) holds. For a tree t=(a, l t ), 
u(t) denotes the tree (a u @l t ,k) and dom(a) (Zdom(a n @l t ) may 
not hold anymore 4 . Given two stores a and a', two locations l£a 
and I'scr' are said to be value equivalent, written (a, l) = (a' , I'), iff 
the two trees a@l and a'@l' are isomorphic (they possibly differ 
only in terms of locations). We write (er, L) = (a' , L') to indi- 
cate value equivalence on location sequences L—(h, . . . ,l n ) and 
L'=(li, . . . , l' n ), with k£a and I'iGa', and holding iff (a, U) = 
(a'J'i] )fori=l..n. 

Definition 2.4 (Independence). Let a be a store and 7 
a variable environment. A query q and an update u are said to be 
independent wrt (<r, 7) if 

a, 7 |= q => <7 q , L q a, 7 |= u : cr u a n ,7 |= q => a' q , L q 

implies (a q , L q ) = (a' q , L' q ). Also, qanduare independent, written 
q JL u, iff they are independent for any pair (a, 7). Finally, q and 
u are independent wrt the DTD d, written q J_ d u, iff for every tree 
t—(a, l t )£d and 7, they are independent for (a, 7). 

As a natural consequence of the fact that XML data are typed by 
a schema, we assume that our independence analysis is run in a 
context where all data remain consistent wrt the schema after each 
update. In case an update entails schema evolution, then a larger 
task of schema maintenance has to be carried on. This task may 
imply existing views (queries) to be reformulated in order to be 
correct wrt the new schema, and thus it is likely to exclude any 
other kind of schema-based analysis until its completion. 

3. CHAIN INFERENCE 

In this section, we define deduction rules to statically infer chains 
for query and update expressions. Our system produces chains of 
different kinds. The classification resembles that of Marian and 
Simeon in [16] for query path extraction, and is needed due to the 
fact that different kinds of chains play different roles in the inde- 
pendence analysis. 



4 CT u @^t discards locations disconnected to l t after the update. 
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A query chain belongs to one of the following three disjoint classes: 

• Return chains type input document nodes (return nodes) that 
are roots of elements returned by the query. All descendants 
of a return node are in the query result, thus a return chain c 
implicitly embodies these descendants. Now, if a change of 
an update u targets a return node or some of its ancestors or 
descendants, query-update independence is not guaranteed. 

• Used chains type nodes {used nodes) belonging to the input 
document and participating to the query evaluation, without 
necessarily being part of the result itself. Clearly, if a change 
of an update u targets a used node or some of its ancestors, 
then query-update independence is not guaranteed. 

• Element chains type newly constructed elements; an element 
chain is of the form a.c', where a is the tag of the constructed 
a element. Extracting these chains is important for the preci- 
sion of the independence analysis (see example below). 

For updates, we have one class of chains: 

• The purpose of an Update chain, denoted by c:c', is twofold: 
c types nodes I whose content may be changed by the update 
and c' types descendants of / (either introduced or removed 
by the update) involved in the changes. For example, given 
c:c', independence is not guaranteed if a query returns an 
element whose root is typed by c.c", with c" a prefix of c'. 

Let us now illustrate why element chains are necessary for a pre- 
cise independence analysis. Consider the following update over the 
well-known XQuery Use Cases DTD [1]: 

for x in / /book return 

insert <author>q </ author > into x 

Here the source expression is an element query, for which we 
infer element chains of the form author.c', with c' a chain in- 
ferred for q'. The update chain bib.book-.author.c' is obtained by 
concatenation of the chain bib.book associated with the target ex- 
pression x, and the chain author.c'. This allows one to conclude 
independence wit the query / /title, whose unique return chain is 
bib. book. title (forasmuch as title element is never a descendant of 
an author element): the update chain is not a prefix of the query 
chain and vice-versa. 

Now, let us do the analysis without considering element chains: 
for the source expression <author> q' </ 'author > , the best that 
can be done is to infer the chain bib.book:, telling that something 
happens beneath book elements. As a consequence, we would not 
deduce the independence. 

In the presence of nested element construction, the same remark 
holds. In the previous example, if q' is 

<first>Umberto</ first>, <second>Eco</ second> 

then by composing element chains during the inference, we end up 
with the following two update chains bib. book:author. first. S and 
bib. book:author. second.S. Indeed, this is necessary to exclude 
independence wrt the query / /author /email (assuming the DTD 
allows for email elements into author elements). 



3.1 Chain Inference for XPath Steps 

The definition of our chain inference system makes the assump- 
tion that the inference is made starting from an input set of chains 
C. This set can be either Cd (infinite analysis) or a finite subset of 



Cd (finite analysis). We would like to stress that assuming a pre- 
computed chain set is only made to ease the formal presentation. 
Any reasonable implementation can avoid this, by inferring chains 
on the fly (see Section 6). 

The first ingredient for query/update chain inference is chain in- 
ference for a single XPath step. We first define chain inference for 
axes, and then for node tests. 

Axis chain inference aims at inferring all chains that can be gen- 
erated by axis navigation, in a d instance, starting from a node typed 
by a chain cGC. Chain inference rules strictly mimic XPath seman- 
tics of axes, and are defined below (notice that c' can be empty): 



Ac(c, self) 
Ac(c, child) 
Ac(c, descendant) 
Ac(c, descendant— or— self ) 
Ac(c, parent) 
Ac(c, ancestor) 
Ac(c, ancestor— or— self ) 



*= { c.Q | c.Q G C } 

d M { c.c' | c.c' G C, c'^e } 

A M { c.c' | c.c' G C} 

d M { c ' | c = c'.a} 

d M { c' j c = c'.c", c'Ve } 

A M { c ' | c = c'.c"} 



In the following rules, with a little abuse of notation, given a chain 
c on d, we use d(c) to indicate either the regular expression d (a), if 
c = c'.a, or the empty regular expression e, if c = c'.S. Chain in- 
ference for preceding/following-sibling axes is defined as follows. 

def 

Ac(c, following— sibling) = {ci./SeC | c=Ci.a, a<d( cl )/3} 

def 

Ac(c, preceding-sibling) = {ci.aGC | c=Ci./3, a< d(ci )/3} 

The relation < r is such that for all a, /3gE U {S}, a < r (3 holds 
if there exists a word u belonging to the language generated by 
r in which an a occurs before a (3. This relation can be easily 
defined by structural induction on r (see [9]). For instance, we 
have < a ,(b | c)*= {( a i b), (o,c), (6,c), (c,6), (c,c), (&,&)}• 
Rules for node-test chain inference are straightforward: 

T c (c,node() ) d = { c } 

Tc(c.a, a) d = { c.a j a = a } 
Tc( c.a, text() ) d = { c.a | a = S } 

Lemma 3.1 (Soundness of Step Chains). Let t£d be 
a tree and l x €dom(t). If a, (x := l x ) \= x/axis::0 => a, L 
then 5 for each I G Lwe have: cf G Tc( Ac(cf x , axis), <j>). 

The proof of soundness is reported in [9], Step chain inference is 
also minimal for any d, see [9] for further details. 

3.2 Chain Inference for Queries 

Inference rules for queries are presented in Table 1 . As usual, a 
variable environment T associates each query free- variable x with a 
set l~(x) of chains, typing nodes that can be assigned to the variable 
during query evaluation. 

Query rules prove judgements of the form: 

T he q : (r; v; e) 

meaning that starting from V and C, the chain inference produces 
the sets r, v and e, respectively containing the return, used and ele- 
ment chains for q. 

5 Notice here that step evaluation does not change a. 
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r he q; : (r ; ; V;; e,j) i=0..2 
f he if qo then qi else q 2 : (ri U r 2 ; Vj U r ; ei U e 2 ) 



(IF) 



T he () : (0; 



(Empty) 



T h c qi : (ri; Vi; ei) 
r[xrtc] he q 2 : (r c ; v c ; e c ) foranycGri 



T he for x in qi return q 2 : ( r c ; vi U (v c U{c}); e c ) 

rcUe c ^ 



(For) 



T h c s 



h {S}) 



(Text) 



T he qi : (ri; vi; ei) r[xrt n] he q 2 : (r 2 ; v 2 ; e 2 ) 
T he let x := qi return q 2 : (r 2 ; riUviUv 2 ; e 2 ) 

axis {self . child, descendant— or— self } 
r c = Tc( Ac(c. axis), <j> ) foranycGf(x) 



(LET) 



T h c x/axis :: <j> : ( (J r c ; (J {c}; 0) 

cer(x) cer(i) 

rc 5* 



(StepUH) 



T h c qi : (r ; ; v 4 ; e 4 ) i=1..2 
T h c qi, q 2 : (ri U r 2 ; vi U v 2 ; ei U e 2 ) 

axis G {self , child, descendant— or— self } 
r c = Tc( Ac(c, axis), <f> ) foranycGT(x) 

T he x/axis::^ : ( |J r c ; 0; 0) 

cer( x ) 



(CONC) 



(StepF) 



T h c q : (r; v; e) 
eo = { a.a.c | c.aGr, c. a. e'er } U { a.c | cGe } U { a \ r U e=0 } 

T h c <a>q</a> : (0; rUv; e ) 

Table 1: Chain Inference Rules for Queries 



(Elt) 



In the rules, r denotes all descendant chains of chains in the set r 
wrt C: 

r = { c.c | cGr , c.c GC ) 

All the rules mimic query semantics [12, 4]. We only comment 
on the main ones. Rules (For) and (Let) are very similar, thus we 
only comment on the (For) rule. It performs an iteration on the 
set of return chains inferred for qi . Return chains for qi are then 
converted into used chains. This is needed because chain inference 
is a bottom-up process: inside qi a path expression is seen as a 
query producing a result (and as such it locally produces return 
chains), while it only selects nodes to be used in the outer iteration 
for x in qi return q2. 

In the (For) rule, irrelevant chains are filtered out. To illustrate, 
consider the query 

for x in //nodeQ return if x/b then x/a 

The chain inference, thanks to chain filtering, only produces used 
chains that lead to either an a or a b node. Otherwise, the set of 
all possible chains generated by the subquery //nodeQ would be 
inferred as used chains, for the whole query; as a dramatic con- 
sequence, the query would be considered as dependent wrt almost 
every update. 

The rule (StepF) produces return chains, those pointing to nodes 
returned by the forward XPath step. The rule (StepUH) is similar, 
and deals with upward and horizontal axes. It also produces used 
chains, by filtering only those bound to the step variable and lead- 
ing to new result chains according to the step navigation. This is 
needed since return chains produced by an horizontal/upward step 
may not contain as a prefix the used chain in l~(x) from which they 
have been generated. E.g., for the DTD d = {a «— (&+, c*)}, and 
the query /a/b/f ollowing— sibling::c, we infer a.b as a used 
chain, and a.c as a return chain. 

Element queries <a>q</a> are dealt with by rule (Elt). This 
rule infers element chains of the form a.c, where c is obtained from 
either an element or return chain of q. The rule also infers used 
chains by collecting: i) used chains of q, and ii) used chains ob- 
tained from return chains of q. To this end r' is used to extend 



returned chains of the inner query. This return-to-used chain con- 
version is needed to correctly handle nested element construction. 
For instance, consider the following query q = <rl>q'</rl>, 
where 

q' = (x/a, <r2>x/b</r2>) 

Element chains for q are inferred in terms of chains for q'. So, el- 
ement chains for q are rl.a and rl.r2.6, assuming that for q' the 
return chains are c.a (for x/a) and r2.b (for <r2>x/b</r2>). 
In order to avoid ending up with a wrong element chain a.b for 
q, the return chain c.b for x/b does not have to be considered as 
a return chain for q' as well. This is handled by the return-to- 
used conversion of the return chain c.b when inferring chains for 
<r2>x/b</r2> (and hence for q'). It is worth stressing that if 
we just convert return chains to used ones without the extension 
r, then we lose their semantic property of representing entire sub- 
trees of data. Notice that this extension is needed for the purpose of 
the formal presentation although any efficient implementation can 
avoid performing these extensions by using intensional representa- 
tions. 

The rule (Text) deals with expressions building new text nodes. 
The rule infers S as an element chain 6 . 



3.3 Chain Inference for Updates 

As seen before, update chains are of the form c:c'. Essentially, 
the prefix c types updated nodes, that are nodes whose children are 
modified by the update, while the suffix c' types modified children 
or new descendants. Update chains are inferred by rules in Table 2 
(only main rules are reported; see [9] for the full set of rules). Chain 
inference for insert-into expressions (position ranges over into, first 
and last) is specified by the rule (INSERT- 1 ). For any chain c:c' in- 
ferred, the prefix c is a return chain of the target query qo (typing 
the insertion point), while the suffix c' is either a return or element 
chain of the source expression q typing a branch of a node ele- 
ment returned by q itself; this element can either be a new one or 
a sub-element of the input document; in both cases the suffix chain 

6 For simplicity, we preferred not to use a 5th class of chains. 
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r he q : (r; v; e) l~ he qo : (ro; Vo; eo) posg{into, first, last} 

U = { c:c' | cGr , c'ge } U { c:a.c" | cgr , c'.agr, c'.a.c"gC } 
T he insert q pos qo : U 

T he q : (r; v; e) l~ he qo : (ro; vo; eo) posg{after, before} 
U = { c:c' | c.QSro, e'ee } U { c:,9.c" | c.aGro, c'./3er, c.'f3.c"eC } 



(Insert- 1) 



(Insert-2) 



\ he insert q pos qo : U 

r he qo : (r ; v ; e ) r he qo : (ro; vo; eo) 

U = { c:a | e.ggro } (Del£te) U = { c:a | cagrp } U { c:b | cagro } (Rename) 

T he delete qo : U l~ he rename qo as 6 : U ' ' "' 

T h c q : (r; v; e) r he qo : (r ; v ; e ) 

= { c:a | c.Qgro } U {c:/3.c" | c.agro, c'./3€r, c'./3.c"gC } U { c:c' | cgr , e'ee} 
T he replace qo with q : U 

Table 2: Chain Inference Rules for Updates 



corresponds to inserted data. Rule (lNSERT-2) is similar, and deals 
with insert-before/after updates. Inference for delete expressions is 
defined by the (DELETE) rule, which simply puts the separator ':' 
just before the last symbol of a return chain of the target query. A 
delete chain c:a captures that a node typed by c has a child labeled 
by a which may be deleted. Similarly, the (RENAME) rule infers 
chains c:a where a is the type of the target node before renaming, 
but it also produces chains c:b typing renamed nodes. The rule for 
replace expressions (REPLACE) is built on the same principles as 
(Insert- 1) and (Delete) rules. 

3.4 Soundness of Chain Inference 

From now on, we consider given a DTD d, and some valid doc- 
ument t=(a, h)€d. For a query q, we assume: 

»■> 7 r= q °q, L q and T hc d q : (r; v; e) 
For an update u, we assume: 

a, 7 |= u => o~ w , w cr w h w ~» <r n and V hc d u : U. 

Recall that u(t) denotes the tree (cr u @lt,lt). Also, for the sake 
of simplicity, queries and updates are assumed to be quasi-closed: 
they contain only one free variable x initially bound to the root of 
the input XML tree (see [9] for the general case). It means that 
7={x i — ^ l t } for query and update evaluation, and l~={x i-> d s } 
for static chain inference. 

Soundness of query chain inference. Proving soundness of 
query chain rules consists of proving that, for any schema instance, 
any node used or built by the query q is captured (typed) by the 
chains inferred for q. The proof relies on the notion of XML pro- 
jection [16, 7]. 

A tree t' is a projection of t, denoted t' ^ t, if t' is obtained 
from t by discarding some subtrees. A projection of a tree t can 
be obtained from a set CC-dom(a), where C is non-empty and up- 
ward closed with respect to the a parent-child relationship 7 . For 
a sequence of locations L, we define L\c as the subsequence of L 
containing only £ identifiers, and preserving L ordering. Then a 
projection of t wrt a set C is defined as t\c=(&\c, h) where 

o\c = {l<r- a[L }c ] | leC, (I «- a[L])£a } U 

{ / <- s I leC, (j<-s)e<7 } 

We say that t\c is a q-projection of t if, assuming that a\c, 7 |= 
q => a', L' we can conclude (er q , L q ) = (a 1 , L'). Given a set of 

7 vi {i£C a (zVo[L])€<7 a leL) I'ec. 



chains r, the set £* of locations in t=(cr, l t ) typed by chains in r 
is defined as: 

L\ d ^ {I I ledom(a), cf.c G r } 

Finally, t\c is a minimal q-projection of t if none of the strict pro- 
jections of t\c is a q-projection. Note that, t' is a q-projection of t 
provided that t\c -< t', for t\c a minimal q-projection. A minimal 
projection is not unique, due to the query language considered. 

The following theorem formally states that chains inferred for a 
query q cover the structure of data relevant for the query, and newly 
constructed elements. 

Theorem 3.2. (Soundness of Query Chains) 

1. lft 1 is a minimal ^-projection oft, then t'<t\ c t 

2. lft! is the subtree of a q rooted at I' <=L q \doin(a) thent ' <t!^ 

The first item of Theorem 3.2 states that chain inference is sound 
for used and return chains: a projection of any valid input tree 
made in terms of used and return chains includes every minimal 
g-projection, hence preserves query semantics (the projection con- 
tains all the query needs for its evaluation). The second item is 
dedicated to element chain inference which is one of the key feature 
of our query-update analysis as already illustrated. Intuitively, this 
statement says that if element chains are used to project newly con- 
structed elements (notice that I' £L q \dom(a)) no node is pruned 
out, so element chains cover all possible chains in new elements of 
the query result. 

Soundness of update chain inference. Proving update chain 
soundness consists in establishing a link between i) nodes in the 
stores t and u(t) that are involved in the changes (deletion, inser- 
tion, renaming and replacements) made by u and ii) nodes in these 
trees which are captured (typed) by the chains statically inferred for 
u. 

Definition 3.3 (Involved Location). We say that the 
update u involves the location l£dom(a w ) if I is either the target 
location of an elementary delete, rename or replace command in w, 
or a critical location or a descendant of a critical location, where 
a critical location is a location in the source list L of a command 
ins(L, _, _) or repl(_, L) in w. 

Note that an involved location may belong to the initial tree t but 
not to the updated tree u(£) and conversely. It may also, of course, 
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belong to both trees. The theorem below states that all locations 
involved by the update u are typed by chains inferred from u. 

Theorem 3.4 (Soundness of Update Chains). If I is 
a location in t, i.e. lEdom(o) (respectively a location in u(t), 
i.e. l<Edom(o u @l t )) and the update u involves I then there exists 
c:c'eU such that c° =c:c' (respectively d[ u =c:c') where c' / e. 

In the above statement, in case a location / belongs both to t and 
u(t), it may be that the chain typing / in t is different from the chain 
typing I in u(i) (e.g., due to renaming). 

Although we made the assumption that the update expression 
is preserving the schema, it is worth noticing that Theorem 3.4 
holds also for updates violating schema constraints (u(t)^d), since 
chains corresponding to deleted or inserted nodes are always traced 
by the system regardless of correctness wrt the schema. 

4. INFINITE ANALYSIS 

The notion of query-update independence q II d u (Definition 2.4) 
is based on the semantics of q and u, and involves all possible d 
instances. The static counterpart of this notion is now proposed 
and is of course based on query and update chain inference. As 
chain inference depends on a set C of chains, we first introduce a 
general static notion of C-independence. 

Given two sets of chains n and t 2 , the set of conflicting pairs of 
chains for T\ and r 2 is defined by: 

confl(ri,r 2 ) = { (ci,c 2 ) | Ci€ti, c 2 Gt 2 , Ci^c 2 } 

Definition 4.1 (C-independence). A query q and an up- 
date u are C-independent, denoted by q J_c u, ;/ provided that 
T he q : (r; v; e) and V he u : U, we have: 

confl(r, U) = confl(U, r) = confl(U, v) = 0. 

The main result of this section states that, when C is taken as the 
set Cd of chains generated for the DTD d, C-independence implies 
q_U_dU independence. 

Theorem 4.2 (Soundness of C d independence). 

q J_c d u implies q_U_d u 

In order to prove Theorem 4.2, the following property is used; it is 
a consequence of soundness of chain inference (Theorems 3.2 and 
3.4). Next, Xj denotes the set of nodes in a tree t typed by update 
chains in U: 

Xj d = { l<=dom(t) | = c:c'eU , c' ^ e } 
Proposition 4.3. If q _l_c d u, then we have: 

xJnrLv = ^ ( "n^ = 

This proposition states that C d -independence implies that nodes 
typed by query chains are disjoint from nodes typed by update 
chains. The proof is reported in [9]. 

As already stated, updates are assumed to preserve the schema. 
The above theorem needs this assumption in order to correctly use 
query chains in the independence analysis. Actually, if deletions 
violate the schema (a mandatory node is deleted), the J_c d is still 
sound. The problem comes from insertions creating new chains 
(not belonging to C d ) because they are not considered during chain 
inference for queries. As a consequence, the analysis made to 
check J_c d could miss conflicting chains. Extending our technique 
so as to capture schema evolution is left as future work. 



5. FINITE ANALYSIS 

The notion of C d -independence (Definition 4.1) cannot be di- 
rectly used to define a terminating decision algorithm, because for 
DTDs with vertical recursion the sets of inferred chains can be in- 
finite. In this section we show how to finitely approximate sets of 
inferred chains so that C d -independence can be detected in finite 
time. 

One feature of chains generated by a recursive DTD d is that 
some of them contain multiple occurrences of (recursively defined) 
tags. So one way to characterize a finite set of d-chains is to restrict 
to chains having at most k occurrences of each tag. Hereafter, these 
chains are called fe-chains, and for any set of chains r, its subset of 
fc-chains is denoted by r k . Thus, C d denotes the set of fc-chains 
generated by d. 

As illustrated next, a multiplicity value k can be inferred from the 
query q and update u, so that independence according to inferred 
chains in C d is equivalent to independence according to inferred 
chains in C d . The value k is derived by a two-steps static analysis. 

Given an expression exp, being either a query q or an update 
u, the first step associates a value fc exp to exp such that the set 
of feexp-chains inferred for exp is representative of all possible in- 
ferred chains for exp. Intuitively, the representative set of inferred 
chains for an expression synthesizes all possible inferred chains: 
any possible inferred chain can be mapped to a chain in the repre- 
sentative set by some folding transformations, according to recur- 
sive definitions in the DTD. The second step infers a value k from 
the values fc q and fc u , such that the search of conflicting chains de- 
cisive for statically detecting q-u independence can be safely done 
in the finite set of inferred fc-chains. 

Inferring the values fc q and fc u mainly depends on navigational 
properties of the XPath expressions occurring in the query and up- 
date. Thus, we start the discussion by focusing on XPath expres- 
sions, and then consider FLWR expressions. 

Dealing with child, self and parent. In this case, a good 
choice for fe p is the maximal tag frequency in the path p. Consider 
the following recursive DTD di : 

r <— a b,c,e 4— f a 4—(b,c,e)* f <—a,g 

For the path p=/r/a/6/ / /a the maximal tag frequency is 2, and 
indeed 2-chains include the representative chain r.a.b.f.a (the only 
chain inferred for this path); the same holds for the navigational 
path /r/o/&///a/parent::/ (note here that the 2-chain r.a.b.f.a 
is a used chain). Similarly, for the path /r/a/b/f/* we choose 
fc p =2, since the wildcard * stands for any label. 

Dealing with descendant and ancestor. When a path p 
makes use of the descendant axis, the length of inferred chains are 
totally unrelated to the length of p (e.g., consider /descendant::?; 
over di). This is what led us to reason in terms of tag frequency 
rather than path length. Furthermore, such a path can lead us to in- 
fer an infinite number of chains over a recursive DTD. To generate a 
finite set of representative chains, the value k v is determined by tak- 
ing into account the number of descendant axes occurrences in p. 

To illustrate, we still consider the schema di, and observe that 
the type a is defined in terms of b, c and e, and vice versa. In a 
valid document instance, a b node can be a descendant of a c node, 
and vice versa, along the same chain of the tree. In addition, a chain 
connecting b and c nodes always contains an intermediate a label, 
which also occurs before the first occurrence of a b, c or e label. As 
a consequence, for the following path p 

/de s cendant : : b/de s c endant : : c/des cendant : : e 
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over the DTD di, the shortest chain that is inferred for the path 
p is r.a.b.f.a.c.f.a.e, a 3-chain. Simple tag frequency, like for 
the previous cases, would lead to k f —l. This is not satisfactory 
because no chain is inferred for p starting from 1 -chains. To re- 
flect the fact that each recursive axis may permit any tag to re- 
peat once in inferred chains, the correct maximal tag frequency 
we have to consider for the path p is 3; in fact, 3-chains do al- 
low to infer a non-empty set of representative chains. Of course, an 
XPath expression may combine both recursive and non-recursive 
axis. In this case, for a path p we obtain fc p as the sum of two 
components computed independently: the maximal tag frequency 
for non-recursive steps, and the number of recursive steps in p. As 
an example, for p=/descendant::6/a/&, we have k p —2 since the 
maximal tag frequency for the descendant-free part /a/6 is 1, and 
there is 1 descendant step /descendant::^. 

Recursive backward axes are handled similarly. Here we have 
to pay attention to the fact that chains navigated by an ancestor 
step are prefixes of some chains generated by previous steps. Con- 
sider p=/descendant::6/ancestor::c. Here fc p has to be such 
that the used chain r.a.c.f.a.h can be generated. Thus, we enforce 
ancestor steps to increment the tag frequency by 1. This is rem- 
iniscent of what we have seen before in the case of descendant; 
the way p is processed can be compared to the way the naviga- 
tional path /descendant::c/descendant::b would be processed, 
as chains containing c ancestors of b need to be generated for p. 

Concerning paths p employing either descendant— or— self 
or ancestor— or— self , fc p is computed as for the self-less axes. 

Dealing with sibling axes. Sibling axes are managed as child 
and parent axes. Let us consider the recursive schema { a<— (b, /*) , 
6<— (6|c)*, f<— (e, g) } and the navigational path /descendants/ 
following— sibling::^. For this path, the used 1-chain a.b.c and 
the return 2-chain a.b.b are the needed chains. The presence of the 
/descendants step entails k=2. 

Dealing with FLWR expressions. Based on concepts pre- 
viously illustrated, we provide now formal definitions to deal with 
the general case of FLWR expressions. 

As seen before, the computation of fc exp is decomposed into two 
tasks. The first one determines via the function T(a, exp) the fre- 
quency of each tag a£T. on the whole expression, in order to derive 
the maximal frequency. The second task computes via the function 
7Z(exp) the maximal number of consecutive recursive steps in the 
whole expression. The value fc exp is the sum of these two values. 
Formally: 



def 



max{ J- (a, exp) | a£E } + IZ(exp) 



The functions J-(a, exp) and TZ(exp) are defined by structural 
induction in Table 3. When exp is a for/let expression, the value 
fc exp is specified by summing the sub-expression values. This is 
motivated by the fact that, for instance, for-expressions are usually 
used to encode nested iterations performed by XPath paths, like in 
the query for x in /a for y in x/b return y. This leads 
in some cases to an overestimation of the value fc exp that would be 
actually sufficient for a finite analysis. For instance, for the query q' 

for x in /a/a return for y in /a/b return x,y 

we have F(a, q')=3, while the value 2 would be sufficient. More 
precision can be obtained by tracing variable bindings in the defi- 
nition of !F(, ). The same argument holds for 1ZQ. However, this 
would make the formalization cumbersome without being a deci- 
sive factor for the analysis. Thus, our choice has been guided by 
simplicity and conciseness of T(, ) and HQ definitions. 



T{a, exp) d = 



1 

max{ T[a, expj } 

T(a, exp) 
1 + T(a, exp) 



if exp is () or s or 

exp is x/axis :: <f> and axis recursive or <f)£{a, nodeQ} 

if exp is x/axis :: <f> and 

axis not recursive and 4>G{a, nodeQ} 

if exp is (exp 1 .exp 2 ) or (if exp then exp L else exp 2 ) 

if exp is (for/let x exp x return exp 2 ) or (delete exp : ) 
or (insert/replace exp 1 exp 2 ) 

if exp is (<b>exp</6>) or (rename exp as b) and bj^a 
if exp is (<b>exp</6>) or (rename exp as b) and b— a 



ft(exp) d M 



max{ 7£(expJ } 
X>(ex P! ) 



if exp is () or s or x/axis :: 4> and axis not recursive 

if exp is x/axis :: <f> and axis recursive 

if if exp is (exp 15 exp 2 ) or (if exp then exp 1 else exp 2 ) 

if exp is (for/let x exp x return exp 2 ) or (delete expj 
or (insert /replace exp 1 exp 2 ) or (rename exp x as b) 



Table 3: J r (, ) and HQ definition. 



The rule for element construction deserves some comment. Note 
that tags of constructed elements are taken into account. Indeed, 
these elements can be inserted by an update as children of existing 
elements, thus generating new chains that can be used by a query. 
Consider the recursive schema {a 4—b, b 4—b?, c?} and the follow- 
ing update u: 

for x in /a/b return 

insert <b><b><c/ >< / b>< / b> into x 

As already outlined, precision of the independence analysis relies 
(among other things) on the chains generated for element construc- 
tion. The rules in Table 3 lead to k n —?>, and thus the chain a.b:b.b.c 
is inferred for the finite analysis. Note that tag frequency for re- 
name expression is determined in a similar way: after renaming, 
the tag frequency may increase, and chains, for the finite analysis, 
have to capture this change. Other rules are self-explicative. 

Finite independence analysis. We see now how to use the 
values fc q and k n in order to determine a k value such that C d - 
independence can be detected by restricting to fc-chains. 

Consider q=/descendant::b and undelete /descendants 
over the previous DTD di. They are dependent and fc q =l, fc u =l. 
We could argue that a sound choice is k=max(k q , k n ), which al- 
lows the finite analysis to infer the query chain r.a.b and the update 
chain r.a:c. Unfortunately, these chains do not conflict, and rule 
out dependence. The problem here comes from the fact that the 
update may change a descendant of a query returned node, and that 
k=max(k cl , k n ) does not permit to capture this in the finite analy- 
sis. To avoid this problem, it is necessary that representative chains 
that are inferred for the update cover query returned nodes. To this 
end, while inferring chains for the update u, structural properties of 
the query q have also to be taken into account. This is obtained by 
setting k to fe q + fc u . 

In the remaining part of this section we focus on one of our main 
results, soundness of the finite analysis. 

Theorem 5.1 (Soudness of Cj Independence). Let d 
be a DTD, q a query and u an update. Let k—k q + fc u as defined 
above. Then: 

q _L c fc u implies qJL d u 

We focus on soundness because completeness (q J_c d u implies 
q _L C fc u) is straightforward, as C^CCd. 

d 

We next develop the main steps of the proof of Theorem 5.1. 
We reason in terms of dependence, rather than independence. We 
prove that Cd -dependence implies -dependence (these notions 
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Figure 2: CDAG for qi , q 2 



directly follow from Definition 4.1) by showing that from any pair 
of chains in Cd, witness of dependence, it is possible to identify a 
pair of fc-chains in Cj, witness of dependence. 

The proof is composed of three steps. First, we show that there 
exists a folding from query chains to fc-query-chains, for any query 
q (Lemma 5.2). Then, we show that there exists a folding from 
query chains to fc-query-chains also preserving the prefix relation 
<, for any pair of queries (q, q') (Lemma 5.3). Finally, we show 
that such a folding exists for chains inferred for any query-update 
pair (q, u) (Theorem 5.1). Proofs are reported in [9]. 
Given a DTD d, we define a folding relation H d C C d xQ as 

^->d == { (ci,C2) | Ci=c.a.ca.c" A C2=c.a.c" } 

Notice that, above, the symbol a is a recursive type of the schema. 
We dub the reflexive and transitive closure of M> d . 

LEMMA 5.2 (FOLDING). For each chain c inferred from a 
query q there exists a chain c' inferred for q such that c c' 
andd is a k q -chain. 

When q and u are Cd -dependent, at least one of confl(U,v), 
confl(r, U), confl(U, r) is nonempty (see Definition 4.1). This im- 
plies that there exists a conflicting pair of inferred chains, witness 
of the C d -dependence of q and u. As updates are defined in terms 
of queries, the next lemma which focusses on "conflicting" query 
chains is needed to conclude the proof of Theorem 5.1. 

Lemma 5.3 (Folding and Conflict Preservation). 
For each pair of chains (ci,C2) inferred for queries qi,q2, and 
such that C\<C2, there exists (c^, c 2 ) inferred for qi, q2, such that 
c'j^c 2 andc i '- J f* A z i with z\ a (k qi +k q2 ) -chain (i=l, 2). 

6. IMPLEMENTATION AND 
EXPERIMENTS 

6.1 Complexity and Implementation 

We implemented in Java our technique for independence analy- 
sis. The crucial aspect of the implementation concerns the choice 
of the data structure for representing inferred chains for the query 
and the update: the overall performance of the analysis depends on 
this. This is because the number of distinct chains inferred for a 
single expression can grow exponentially with the size of the ex- 
pression to analyze 8 . 

In order to avoid this blow-up we represent a set of inferred 
chains for the query (or update) as a chain-DAG (CDAG) where 
common prefixes and suffixes shared by different chains are merged 
according to the following principle. A CDAG is rooted at the 
schema root type, contains no self-loops and meets the following 

8 This happens for schemas that make heavy use of recursive defini- 
tions, but also for non-recursive ones, like for instance d = { o, «— 
(bi,a)* bi,a<r- Oi+i , i = l..n } (for a query q = //a n the 
number of inferred chains is 2"). 



property: for all type a defined in d, there is at most one CDAG- 
node of type a at a distance h from the root. In other words, if two 
chains happen to have the same type-name in position h, they will 
share a common node in the CDAG at depth h. This implies that 
during chain inference the width of the CDAG is upper-bounded by 
the schema size. 

The CDAG representation requires a small overhead in order to 
distinguish two chains inferred for distinct sub-expressions, in the 
case that these chains share some nodes of the graph. To this end, 
any edge connecting two CDAG-nodes is labeled with a code iden- 
tifying the query/update expressions that created it during chain in- 
ference. These identifiers are necessary to correctly perform chain 
inference for backward axes, and independence checking as well. 
For instance, consider the query qi , q 2 = / /c/e, /a/d/c/f. As- 
sume qi and q2 produce {a.b.c.e, a.d.c.e} and {a.d.c.f}. The 
CDAG representation of these chains is illustrated in Figure 2. First, 
we clearly see that the merge does not produce non-existing chains 
as artifacts: by following query codes there is no way to trace a 
chain a.b.c.f. Second, we can observe that if in qi,q2 we had 
q2 = /a/d/c/ f /ancestor::*, when inferring chains for the last 
step of q2, thanks to edge-labeling we avoid to navigate upward 
parts of the CDAG that have not been generated by q2 (i.e., a b 
node). Notice that backtracking on unvisited nodes would not affect 
the correctness of the analysis, but would compromise precision of 
the independence analysis. 

An auxiliary index associates each expression with nodes repre- 
senting ending points of inferred used and return chains (e and / 
nodes in Figure 2). Concerning element chains, these are kept in a 
specific/separate component of the CDAG, in order to distinguish 
among chains those typing input and those typing constructed data. 

The following theorem proves that the chain inference needed 
for checking Cj -independence has polynomial complexity. 

THEOREM 6.1. By using CDAGs, finite chain inference for a 
DTD d, a value k, and an expression exp, can be done in 0(k 2 x 
|d| 4 ) space and 0(|exp x k' 2 x |d| 5 ) time. 

For space reason, the proof is reported in the full version [9]. Here 
we discuss some cases of practical relevance for which complexity 
is better than that stated in Theorem 6.1. 

When the test condition nodeQ is not used in XPath steps, then 
time complexity is 0(|exp| x k 2 x jdj 4 ). This is because each in- 
ference step would produce 0(k x |d|) nodes in the CDAG (i.e., 
at most one for each CDAG level), while with node() it produces 
O (k x | d | ) . Furthermore, if we assume that during chain inference 
each XPath step can have at most m CDAG nodes as input, time 
complexity goes down to 0(m x |exp| x k x |d| 3 ). The value m 
is likely to be close to 1 for most XPath steps used in practice. This 
holds in particular for XMark and XPathMark expressions. An- 
other fact observable from such expressions is that they employ a 
small number of recursive navigations, thus making chain inference 
doable in 0(|d| 3 ) time. 

When d is not recursive, the k value stops being determinant for 
the analysis since no label repeats twice in any d chain. In this 
case the number of edges of the CDAG is bound by the size of 
the parent-child relation induced by the schema 9 . Therefore the 
spatial complexity goes down to 0(|d| 4 ), while time complexity is 
0(|exp| x |d| 2 ). If we also assume the absence of the test filtering 
nodeQ, time complexity is 0(|exp| x |d|). These restrictions are 
often met in practice, and in particular by expressions used in our 
testbed (when the recursive component of the XMark schema is not 
visited at all by the expression). 

9 If the parent-child relation has more than |d|(|d| — 1) elements the 
schema is recursive. 
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Once chain inference is done for a query q and an update u, inde- 
pendence (Definition 4.1) is checked over the two inferred CD AGs. 
This check can be done in C*(c x |q| x |u|) time, where c is the size 
of the smallest CDAG. 



6.2 Experiments 

We performed extensive experiments by using our Java imple- 
mentation, in order to measure i) efficiency, ii) precision and Hi) 
scalability of our static analysis. We used two different bench- 
marks: a first one based on XMark /XPathMark, and a second one, 
dubbed R-benchmark, we specifically designed to measure Hi). 
Concerning the first one, we used a superset of the view mainte- 
nance benchmark adopted by Benedikt and Cheney in [6]. Our 
benchmark is composed of a set of 36 views v, and a set of 31 up- 
dates Uj. A view is a query belonging to either the XMark query 
set qi-q 2 o [20], or to the XPathMark query set A1-A8/B1-B8 
[13]; Ai queries only use downward axes, whereas Bi queries use 
upward and horizontal axes as well. Concerning updates, a first set 
corresponds to those used in [6] ; these are derived from the XPath- 
Mark query set A1-A8I 'B1-B8 and are of the form UAi=delete 
Ai or UBi=delete Bi. We added a set of 15 updates formed by in- 
sert expressions UI1-UI5, rename expressions UN1-UN5, and 
replace expressions UP1-U P5. These updates have been defined 
so as to cover all different types of nodes in XMark documents, 
and in particular those parts defined by mutually recursive types. 
It is worth remarking that, even if not all of the delete-updates of 
the testbed preserve the schema (see UA4, UA5, UA6, UA7, UA8, 
UB1, UB5, UB6, UB7, UB8 ), the correctness of our technique is 
still ensured, since no new chain is created by these expressions. As 
outlined before, our technique is just unaware of new chains built 
by breaking schema constraints. In light of this, insert, rename and 
replace update expressions have been chosen in order to preserve 
document validity. Before performing the tests, XMark and XPath- 
Mark expressions have been opportunely rewritten into expressions 
belonging to the XQuery fragment we consider (Section 3), as done 
in [5]. The rewriting essentially consists of: putting predicate con- 
ditions in disjunctive form, removing attribute use, and extracting 
paths from functions calls and arithmetic expressions. Clearly, the 
rewriting is such that a query and an update are independent if the 
rewritten query and update are. Due to lack of space, queries and 
updates are reported in the full version [9]. 

We used the above described benchmark to measure precision 
and efficiency of our technique. Concerning the R-benchmark, it 
is designed for understanding the impact of recursion in the perfor- 
mances of our analysis. It is formed by schemas and expressions 
with a massive use of recursion; it is described later on. 

We ran all tests on a desktop 4-core Intel Xeon 2.13 GHz ma- 
chine with 8 GB RAM (the JVM was given 2 GB) running Linux. 
To avoid perturbations coming from system activity, we ran each 
experiment ten times, discarded the best and the worst performance, 
and computed the average of the remaining times. 

Runtime on XMark. We measured the time needed by the static 
analysis to detect independence of each update wrt the whole set of 
XMark views. The XMark schema is particularly suitable for test- 
ing the performances of our technique since the type dependency 
graph of this schema contains 5 mutually recursive types that form 
two cliques of size 2 and 3 respectively. We recall that the execu- 
tion cost depends on the three parameters jd|, |exp| and k. In this 
testbed we have |dj = 76, and |exp| < 20, while multiplicity val- 
ues k range from 2 to 6. As observed in Section 6.1, in many cases 
chain inference is made in 0(|exp| x |d|) time. 



Time values include the time for CDAGs inference and compar- 
ison, for each pair of expressions. Results are collected in Figure 
3. a. It shows that the analysis is quite fast: in the worst case the 
analysis is performed in less than 40 ms for the whole set of views, 
while the average cost is around 15 ms. According to complexity 
results of Section 6.1, inference time is influenced by i) the k val- 
ues needed by a query-update pair and ii) the number of recursive 
types of the schema effectively unfolded. We see small changes in 
inference time values according to the k value (e.g., the pair UB1- 
UB2). Yet, two expressions having the same k value may have dif- 
ferent time costs for chain inference, depending on the effective 
number of recursive types unfolded by the analysis (e.g., the pair 
UI3-UP3). 

Running times obtained from the available OCaml implementa- 
tion 10 of the analysis presented in [6] are rather close to ours: the 
average time for analyzing an update vs all of the views is around 
10 ms. It is worth observing, that inference time for [6] has no 
sensible oscillations, while in our case inference time depends on 
k, hence on the query and update expressions. The analysis pre- 
sented in [6] has time complexity 0((|d| 2 +|q|) 2 + |u|), and thus is 
expected to be faster than our analysis in the presence of recursive 
schemas. Nevertheless, as shown shortly, our running times remain 
low enough to ensure high time savings in views maintenance, even 
when views are defined on relatively small documents. 

Precision on XMark. Independence (Definition 2.4) is unde- 
cidable in general [6], so for the purpose of measuring precision, for 
each update u; we manually determined independent pairs (u^ , Vj ), 
details are reported in the full version [9] (note that for most pairs 
in the considered testbed independence is evident, so this process is 
much less time consuming than one may guess). We then express 
precision as the percentage of independent pairs that are deemed 
independent by our static analysis too. To estimate improvements 
wrt the alternative schema-based technique [6] we computed the 
same percentages for that technique by using the public tool 10 . 

Results are reported in Figure 3.b. Our chain-based analysis 
turned out to be precise. Percentages go from 72% to 100%, while 
the average precision is 96%. Also, Figure 3.b shows that the analy- 
sis proposed in [6] (that has an average detection of 49%) is always 
outperformed in terms of precision by our static analysis, and in 
some cases improvements are huge. This happens in particular for 
updates UB1, UB5, UB6, UB8 (employing backward and horizon- 
tal axes). 

For these updates, the over-approximation made by type rules in 
[6] entails a high number of false negatives. Our chain based in- 
ference instead is so precise to avoid most of these false-negatives. 
In general, improvements in terms of precision go from 8% (UN4) 
to 96% (UP1), and the average gain is 46%. In particular, preci- 
sion of our analysis remains high in the presence of views using 
upward and horizontal axes (XPathMark queries in the group B). 
These queries are likely to be among the most expensive ones to 
re-evaluate after document updating. 

Maintenance time on XMark. We measured time savings ob- 
tained by avoiding the re-materialization of views which our anal- 
ysis deem as independent of an update. We used three XQuery 
engines: Saxon 9.2EE, BaseX 7.0.1 and QizX 4.4. We considered 
a 1MB XMark document and we scaled to 10MB and 100MB, in 
order to measure time savings in real scenarios. 

Our test results only take into account query answering time. 
Full details about engine configurations can be found in [9]. For 

10 http : / /homepages . inf . ed . ac . uk/ j cheney/ programs 



881 



40 
30 
20 
10 




I 1 I I I I I I I 

types [6] i****-** 1 chains — 


\ 


i i i i i i i i i i i i i 


1 1 1 1 1 








1 




\ 




\ 








X 






\ 

\ 




\ 
\ 

s 




\ 
\ 

^ 




\ 
\ 





UA1 UA2 UA3 UA4 UA5 UA6 UA7 UA8 UB1 UB2 UB3 UB4 UB5 UB6 UB7 UB8 UI1 UI2 UI3 UI4 UI5 UN1 UN2 UN3 UN4 UN5 UP1 UP2 UP3 UP4 UP5 



b) §6 
o o 

■DT3 



100 

75 
50 
25 




■■n 



C) »E 
E 



10000 
1000 
100 
10 

1 

0.1 



UA1 UA2 UA3 UA4 UA5 UA6 UA7 UA8 UB1 UB2 UB3 UB4 UB5 UB6 UB7 UB8 UI1 UI2 UI3 UI4 UI5 UN1 UN2 UN3 UN4 UN5 UP1 UP2 UP3 UP4 UP5 

log 



types[6] ™ 

i chains c=a mm 



1Q: 




10' 

a— . 1 

CD en u - 1 

d) 5e 0.01 

I"* 3 0.001 



k=|exp|+10 i=i ----p 
k=exp.+5 ^ m .... 
k=|exp| ^ _ 



3 



= = 



Saxon BaseX QizX Saxon BaseX QizX Saxon J3aseX QizX 



100MB 



e1 e5 e10 e1 e5 e10 e1 e5 e10 e1 e5 e10 e1 e5 e10 e1 e5 e10 
d1 d3 d5 d1 d20 auctions 



Figure 3: Test results 



this experiment only, the JVM was given 4GB of RAM, in or- 
der to minimize memory swapping. Results are reported in Fig- 
ure 3.c. 

As in [6], for each update we measured the time n needed 
for refreshing all the 36 views after the update, and the time rf pe 
and rf""" needed to refresh only views that are not deemed as in- 
dependent by the static analysis of [6] and by ours, respectively. In 
Figure 3.c, for each of the three used engines we report the aver- 
ages of all refreshing times r;, rf" rf a,n . As a consequence of time 
efficiency and precision of our static analysis, even for a relatively 
small document of 1MB, our independence analysis ensures high 
time savings for all engines: 82% for Saxon, 75% for BaseX and 
85% for QizX. While type based analysis [6] ensures much lower 
time savings: 36% for Saxon, 31% for BaseX and 37% for QizX. 

These percentages are essentially the same as those obtained for 
10MB and 100MB documents, both for our technique and for that 
of [6]. This is because in the considered benchmark, queries that 
are not statically deemed as independent of an update, and hence 
refreshed, are the most expensive ones to refresh. 

Scalability on R-benchmark. The benchmark is composed of 
a parametric schema An including n fully-mutually recursive types 
(each of the n types is defined in terms of all the n types), and a 
set of XPath expressions em, each one consisting of m consecu- 
tive descendant : : node ( ) steps. Parameters n and m allow us to 
range over several configurations and trace the perimeter of appli- 
cability of our technique. We considered four schemas dn with n 
ranging over {1, 3, 5, 10, 20}, and, for each schema, three expres- 
sions em with m ranging over {1, 5, 10}. Also, for each expression 
em we considered k ranging over { \em\, |em|+5, |em| + 10}. 
Observe that \dn\—n and |em|=m. 

The schema d5 is quite complex, it contains 5 mutually recur- 
sive types. We can see from Figure 3.d that even with such complex 
form of recursion, for e5, and for each k G {5, 10, 15}, chain infer- 
ence is still fast (inference time is around a decimal of a second). 
For schema dlO, featuring an extremely complex form of recur- 
sion, inference time is around five seconds for e5, while for elO 
the time exceeds ten seconds. The same happens for more complex 
cases. 

These test results show that even for forms of recursions that are 
unlikely to occur in practice (like the d5-e5 case), chain inference 



is still fast, while it takes more than one second for extremely com- 
plex cases. Figure 3.d also report test results on chain inference 
of expressions em over the XMark DTD. As it can be seen, if we 
make a comparison with the d3 case (recall that the largest clique 
has size 3 in the XMark schema) the number of type definitions 
(76 in this case) have an impact on inference time, since the query 
expressions make a massive use of descendant: :node() steps. 
As already discussed, when such step is not used, inference time 
drastically reduces, as often happens in practice, and in particular 
for many XMark/XPathMark expressions. 

7. EXTENSIONS 

Queries and updates. While we have considered all update 
operators made available by XQuery Update Facility, the XQuery 
fragment we have considered (the same as the one considered in 
the related approaches [6, 5]) leaves out several query mechanisms. 
These can be handled by means of two possible methods. The first 
one is based on query rewriting. A basic form has been used in [5], 
as well as in our experiments (see Section 6). The second method 
is based on providing new inference rules. The two methods can be 
used together, and are both easy to develop, except for user defined 
recursive functions, whose treatment is beyond the scope of this 
work since they introduce Turing completeness. 

For space reason, details about extensions are given in the full 
version [9]. Here, we would like to stress that what makes them 
easy to develop is our static concept of C-independence, based on 
the notions of used, return and element nodes (Section 3). These 
are universal and essential notions, in the sense that, for any kind 
of query construct that one could think of adding to the framework, 
analyzing the role of a node with respect to this construct makes 
the node fall in one of these three categories. Thus, generalizing 
our framework for a new query construct mainly consists of iden- 
tifying how used, return and element nodes are determined. This 
simply requires understanding the standards concerning the query 
construct semantics, and reusing principles followed in the treat- 
ment of the core language in Section 2. 

Schemas. Concerning schemas, our technique can be extended 
in order to deal with Extended DTD [14], capturing XML Schema 
and RelaxNG types. 
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DEFINITION 7.1. An Extended DTD is specified by a tuple 
(E, E', s, d, fj,) where (E' , s, d) is a DTD and p, is a function from 
E' U {S} to E U {S} such that p{S)=S. 

A tree / is valid wrt (E, E', s, d, n) if and only if t'=p(t) and t' is 
valid wrt the DTD (E', s, d). Following [14], in an EDTD we can 
assume E'={ai|a£E} and p(a,j)=a for all aj€T.'. This implies 
that two types differently indexed produce the same label but pos- 
sibly different content models. That said, it is sufficient to change 
Definition 2.1, and the definition of Tc(c, a) with aGE. Both 
changes are straightforward. Notice that precision of the inference 
as well as complexity results remain unchanged for the EDTD case. 

Concerning attributes, extensions are straightforward, and actu- 
ally implemented in our prototype (a simple rule for dealing with 
the attribute axis is needed). Concerning ID/IDREF constraints 
in DTDs, and key/keyref constraints in XSDs (studied in [2]), we 
assume they are preserved by updates, as we assume that validity 
is preserved (Section 2). So, in order to ensure precise and sound 
independence analysis, chain inference does not need to consider 
these constraints. Our notion of C-independence only concerns the 
type component of the schema, while these constraints pose restric- 
tions on the values of attributes and elements in a document, and 
do not impact its structure. 

8. RELATED WORK 

Besides [15] and [6], already discussed, another work quite close 
to ours is that recently presented by Benedikt and Cheney in [5]. An 
important contribution of this work was a schema-less framework 
that factors the problem of independence analysis into two sub- 
problems: i) statically inferring a set of destabilizers queries from a 
query, and ii) checking whether destabilizers overlap with the target 
nodes of an update. Precision of the technique highly depends on 
the kind of destabilizers that are inferred from XPath steps. To this 
regard, the inferred destabilizers for steps of the form x/child::b 
include x/child::* (a similar inference is made for other down- 
ward axes). As a consequence any update touching a non-b node 
which is a sibling of a 6 node selected by x/child:: b would not 
be detected as independent of x/child::b, while it should. In the 
presence of a schema, our technique detects independence for these 
cases, thus ensuring a much higher degree of precision. It is worth 
observing that precision of this destabilizer-based approach could 
be improved by adopting a different destabilizer inference system, 
but yet ensuring high precision could be hard since, as shown in 
[5], there is non elementary algorithm for constructing a minimal 
static destabilizer. 

Type-based projection techniques [7, 3] could be extended to de- 
tect query-update independence. However, as type-projectors re- 
semble to types inferred by [6], the extension would not be as pre- 
cise as our technique. Also, both techniques [7, 3] only consider 
DTDs, while chain-based analysis works for EDTDs too. 

Raghavachari and Shmueli [18] considered a downward subset 
of XPath, and found fragments for which independence turns out to 
be either a polynomial or an NP-hard problem; schema information 
was not considered. 

9. CONCLUSIONS 

We presented a type system able to statically detect XML query- 
update independence. One of the main feature of the type system is 
the chain inference component, allowing to infer information at the 
basis of an highly precise analysis. One of the key contributions of 
the work is a method to restrict the analysis to a finite set of chains 



in the presence of recursive schemas. As shown by examples and 
experiments our technique ensures high improvements in terms of 
precision wrt the state-of-the art schema-based technique [6]. 
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